1 Networks 1 Obi

1.1 Learning Objectives

What are the three main take aways from today?

  • Defining graphs: nodes, edges and the adjacency matrix.
  • Directed vs undirected graphs
  • Connected graphs
  • Calculating node degree
  • Shortest paths and diameter
  • Performing the above calculations on the London Tube network

Possible useful resource: https://kateto.net/wp-content/uploads/2016/01/NetSciX_2016_Workshop.pdf

1.2 Sample data

“A network is, in its simlpest form, a collection of points joined together in pairs by lines” M.E.J Newman, Networks, An Introduction

There are many systems that can be usefully thought of as networks. For example:

  • The internet (computers joined together by data connections)
  • Food webs (animals and plants joined by preditor-prey relationships)
  • Transport networks (intersections joined by roads or stations joined by trains)
  • Economic networks (businesses joined by transactions)

Consider hypothetical Twitter data. Each column representing who that person follows.

dfTwitter <- data.frame("Bonnie" = c(0,0,0,1,1), 
                        "Matt" = c(1,0,0,1,0),
                        "Elsa" = c(0,1,0,0,1),
                        "Neave" = c(1,0,0,0,0),
                        "Obi" = c(0,1,1,0,0),
                        row.names = c("Bonnie", "Matt", "Elsa", "Neave", "Obi"))

dfTwitter
##        Bonnie Matt Elsa Neave Obi
## Bonnie      0    1    0     1   0
## Matt        0    0    1     0   1
## Elsa        0    0    0     0   1
## Neave       1    1    0     0   0
## Obi         1    0    1     0   0

From this we see that Bonnie follows Neave and Obi, Matt follow Bonnie and Neave, etc…

Now create an iGraph graph object from this data

graphTwitter <- graph_from_adjacency_matrix(as.matrix(dfTwitter))
plot(graphTwitter)

Some important definitions to introduce here are:

Graph

A Graph is a mathematical representation of a network, with n nodes and m edges, as seen above. In this example, the nodes represent Twitter accounts and the edges represent following relationships between the accounts.

Nodes (Vertices)

V(graphTwitter)
## + 5/5 vertices, named, from a8a6bc7:
## [1] Bonnie Matt   Elsa   Neave  Obi

Edges

E(graphTwitter)
## + 9/9 edges from a8a6bc7 (vertex names):
## [1] Bonnie->Matt   Bonnie->Neave  Matt  ->Elsa   Matt  ->Obi   
## [5] Elsa  ->Obi    Neave ->Bonnie Neave ->Matt   Obi   ->Bonnie
## [9] Obi   ->Elsa

An equivalent representation of this network is the edge list, with edges pointing from the nodes in columns 1 to the nodes in column 2.

get.edgelist(graphTwitter)
##       [,1]     [,2]    
##  [1,] "Bonnie" "Matt"  
##  [2,] "Bonnie" "Neave" 
##  [3,] "Matt"   "Elsa"  
##  [4,] "Matt"   "Obi"   
##  [5,] "Elsa"   "Obi"   
##  [6,] "Neave"  "Bonnie"
##  [7,] "Neave"  "Matt"  
##  [8,] "Obi"    "Bonnie"
##  [9,] "Obi"    "Elsa"

Adjacency matrix

Another equivalent, and frequently used representation of a network is the adjacency matrix.

In iGraph the convention for adjacency matrix elements is (some textbooks use the reverse convention)

\[A_{ij} = \begin{cases} 1, & \text{if there is an edge from node i to j}\\ 0, & \text{otherwise} \end{cases}\]

In this case, the dataframe used to create the network is essentially the adjacency matrix:

get.adjacency(graphTwitter)
## 5 x 5 sparse Matrix of class "dgCMatrix"
##        Bonnie Matt Elsa Neave Obi
## Bonnie      .    1    .     1   .
## Matt        .    .    1     .   1
## Elsa        .    .    .     .   1
## Neave       1    1    .     .   .
## Obi         1    .    1     .   .

\(A_{Bonnie,Matt} = 1\)

graphTwitter['Bonnie','Matt'] # as.matrix(get.adjacency(graphTwitter))['Bonnie','Matt'] also works 
## [1] 1

Directed Graph

This is a directed graph meaning that node j might be connected to node i, but node i might not be connected to node j.

Directed graphs have adjacency matrices that are not symmetric

isSymmetric(get.adjacency(graphTwitter))
## [1] FALSE

Bonnie is connected to Matt but Matt is not connected to Bonnie

print(graphTwitter['Bonnie','Matt'])
## [1] 1
print(graphTwitter['Matt','Bonnie'])
## [1] 0

1.3 Sample analysis

With the information about our social network represented as a Graph, we can perform some analysis.

Node degree

Undirected graph: the number of links attached to a node \(k_i = \sum_{i=1}^{n}A_{ij} = \sum_{j=1}^{n}A_{ij}\)

Directed graph: either the number of ‘in’ links or ‘out’ links attached to a node.

In degree: number of links pointing to you \(k_{i}^{in} = \sum_{j=1}^{n}A_{ij}\)

Out degree: number of links you point to \(k_{j}^{in} = \sum_{i=1}^{n}A_{ij}\)

In degree of the Twitter Graph

degree(graphTwitter, mode = "in")
## Bonnie   Matt   Elsa  Neave    Obi 
##      2      2      2      1      2

Out degree

degree(graphTwitter, mode = "out")
## Bonnie   Matt   Elsa  Neave    Obi 
##      2      2      1      2      2

Paths

A path is a sequence of nodes such that each node is connected to the next node. For example: [‘Bonnie’, ‘Neave’, ‘Matt’, ‘Elsa’].

The shortest path is the shortest possible path between any two nodes.

[‘Bonnie’, ‘Neave’, ‘Matt’, ‘Elsa’] is a path from Bonnie to Elsa but the shortest path is…

shortest_paths(graphTwitter, 'Bonnie', to = 'Elsa', 
               mode = "out", weights = NULL, output = c("vpath", "epath", "both"),
               predecessors = FALSE, inbound.edges = FALSE)
## $vpath
## $vpath[[1]]
## + 3/5 vertices, named, from a8a6bc7:
## [1] Bonnie Matt   Elsa  
## 
## 
## $epath
## NULL
## 
## $predecessors
## NULL
## 
## $inbound_edges
## NULL

or use mode=“in” to find the shortest path from the ‘to’ node to the ‘from’ node

shortest_paths(graphTwitter, 'Bonnie', to = 'Elsa', 
               mode = "in", weights = NULL, output = "vpath",
               predecessors = FALSE, inbound.edges = FALSE)
## $vpath
## $vpath[[1]]
## + 3/5 vertices, named, from a8a6bc7:
## [1] Bonnie Obi    Elsa  
## 
## 
## $epath
## NULL
## 
## $predecessors
## NULL
## 
## $inbound_edges
## NULL

Connected

A Graph is connected if it is possible to reach any node from any other node. Directed graphs can be ‘strongly’ or ‘weakly’ connected.

  • Strongly connected: it is possible to reach every node from every other node whilst following the directions of the edges
  • Weakly connected: it is possible to reach every node only if the directions of the edges are disregarded (treat as undirected graph)

The Twitter graph is strongly connected (which means it is also weakly connected)

is.connected(graphTwitter, mode = "strong")
## [1] TRUE

Diameter

The diameter of the network is the length of the longest shortest path.

diameter(graphTwitter)
## [1] 3

For only weakly connected directed graphs the diameter is infinite (since it is not possible to connect some node pairs) and so a slightly different definition of diameter must be used.

1.4 Real World Example

Use GIS data of the London Underground to produce graph representing the tube network.

Stations -> Nodes Connections between stations -> Edges

Read the shapefile data

Linestrings with each end being the coordinates of a station

dir_data <- ".\\data\\underground"
sfTube <- readOGR(dsn = dir_data, "underground")
## OGR data source with driver: ESRI Shapefile 
## Source: "S:\CASA_obits_ucfnoth\4. Collaborations\SS-ASM19-Materials\networks\data\underground", layer: "underground"
## with 410 features
## It has 6 fields
## Integer64 fields read as strings:  station_1 station_2

The data contain line strings between the coordinates of tube stations

head(sfTube@lines)
## [[1]]
## An object of class "Lines"
## Slot "Lines":
## [[1]]
## An object of class "Line"
## Slot "coords":
##         [,1]    [,2]
## [1,] -0.1571 51.5226
## [2,] -0.1631 51.5225
## 
## 
## 
## Slot "ID":
## [1] "0"
## 
## 
## [[2]]
## An object of class "Lines"
## Slot "Lines":
## [[1]]
## An object of class "Line"
## Slot "coords":
##         [,1]    [,2]
## [1,] -0.1571 51.5226
## [2,] -0.1466 51.5234
## 
## 
## 
## Slot "ID":
## [1] "1"
## 
## 
## [[3]]
## An object of class "Lines"
## Slot "Lines":
## [[1]]
## An object of class "Line"
## Slot "coords":
##         [,1]    [,2]
## [1,] -0.1247 51.5080
## [2,] -0.1223 51.5074
## 
## 
## 
## Slot "ID":
## [1] "2"
## 
## 
## [[4]]
## An object of class "Lines"
## Slot "Lines":
## [[1]]
## An object of class "Line"
## Slot "coords":
##         [,1]    [,2]
## [1,] -0.1247 51.5080
## [2,] -0.1342 51.5098
## 
## 
## 
## Slot "ID":
## [1] "3"
## 
## 
## [[5]]
## An object of class "Lines"
## Slot "Lines":
## [[1]]
## An object of class "Line"
## Slot "coords":
##         [,1]    [,2]
## [1,] -0.1679 51.5199
## [2,] -0.1631 51.5225
## 
## 
## 
## Slot "ID":
## [1] "4"
## 
## 
## [[6]]
## An object of class "Lines"
## Slot "Lines":
## [[1]]
## An object of class "Line"
## Slot "coords":
##         [,1]    [,2]
## [1,] -0.1679 51.5199
## [2,] -0.1755 51.5154
## 
## 
## 
## Slot "ID":
## [1] "5"

and attributes of the linestrings

head(sfTube@data)
##   toid_seq station_1       station_1_ station_2       station_2_ distance
## 0        1        11     Baker Street       163       Marylebone 416.5861
## 1        2        11     Baker Street       212    Regent's Park 734.1736
## 2        3        49    Charing Cross        87       Embankment 179.5034
## 3        4        49    Charing Cross       197 Picadilly Circus 689.2898
## 4        5        82 Edgware Road (B)       163       Marylebone 441.2181
## 5        6        82 Edgware Road (B)       193       Paddington 727.2998

Create a graph from the data

Use the attribute data from the shape file to create a graph representing the tube network.

dfTube <-  sfTube@data
dfTubeGraph <- data.frame(i = dfTube$station_1, j = dfTube$station_2, distance = dfTube$distance)
graphTube <- graph_from_data_frame(dfTubeGraph, directed = FALSE)
head(get.edgelist(graphTube))
##      [,1] [,2] 
## [1,] "11" "163"
## [2,] "11" "212"
## [3,] "49" "87" 
## [4,] "49" "197"
## [5,] "82" "163"
## [6,] "82" "193"

Add node attribute to the graph

Process the data to get the coordinates, names, and ids of all stations

# fortify extracts x and y coordinates from the points that make up the linestrings and the id of the linestring the points belong to
sfTube.f=fortify(sfTube)

# Create id column that matches the ids of the linestrings
dfTube$id <- as.numeric(as.character(dfTube$toid_seq)) - 1

# Join the coordinates to the data from the shape file using the id column
dfMergedTube <- merge(sfTube.f, dfTube, by.x = "id", by.y = "id")

# Separate out the first and second stations and stack these on top of one another in one long dataframe.
firstStation <- dfMergedTube[dfMergedTube$order == 1,]
secondStation <- dfMergedTube[dfMergedTube$order == 2,]

firstStation <- data.frame(id=firstStation$station_1, name = firstStation$station_1_, lat = firstStation$lat, long = firstStation$long)
secondStation <- data.frame(id=secondStation$station_2, name = secondStation$station_2_, lat = secondStation$lat, long = secondStation$long)

dfStations <- rbind(firstStation, secondStation) %>% unique()
head(dfStations)
##     id               name     lat    long
## 1   11       Baker Street 51.5226 -0.1571
## 3  114 Harrow & Wealdston 51.5925 -0.3351
## 4    3       Aldgate East 51.5154 -0.0726
## 6   15            Barking 51.5396  0.0810
## 8   17       Barons Court 51.4905 -0.2139
## 10  18          Bayswater 51.5121 -0.1879

Find the index at which each vertex appears in the dataframe of stations and use this to assign names and positions to the verticies

# Since the vertices don't have names yet, V(graphTube)$name returns a vector of vertex ids
v_pos = match(V(graphTube)$name, as.character(dfStations$id))

# Use the vector of positions to get the node names and coordinates ordered the same as they are in the graph
vector_names <- dfStations$name[v_pos]
vector_coords <- dfStations[v_pos,]

graphTube <- set.vertex.attribute(graphTube, "name", index = V(graphTube), value = as.character(vector_names))
graphTube <- set.vertex.attribute(graphTube, "x", index = V(graphTube), value = as.numeric(vector_coords$long))
graphTube <- set.vertex.attribute(graphTube, "y", index = V(graphTube), value = as.numeric(vector_coords$lat))

Nodes now all have name, x, and y attributes

print(get.vertex.attribute(graphTube, "name", 11))
## [1] "Lambeth North"
print(get.vertex.attribute(graphTube, "x", 11))
## [1] -0.1115
print(get.vertex.attribute(graphTube, "y", 11))
## [1] 51.4991

1.5 Numeric Results

This graph is undirected and connected.

print(isSymmetric(get.adjacency(graphTube)))
## [1] TRUE
print(is.connected(graphTube))
## [1] TRUE

The graph includes an edge attribute for each edge - ‘distance’ - taken from the distance between stations

get.edge.attribute(graphTube)$distance[1:10]
##  [1]  416.5861  734.1736  179.5034  689.2898  441.2181  727.2998  954.9616
##  [8]  698.0515 1537.3971  783.2256

Show adjacency matrix with and without weights (topographic vs weighted)

For cases where there are multiple edges between nodes, the inclusion of weights can be misleading.

get.adjacency(graphTube)[1:10,1:10]
## 10 x 10 sparse Matrix of class "dgCMatrix"
##    [[ suppressing 10 column names 'Baker Street', 'Charing Cross', 'Edgware Road (B)' ... ]]
##                                       
## Baker Street       . . . . . . . . . .
## Charing Cross      . . . . 2 . . . . .
## Edgware Road (B)   . . . . . . . . . .
## Elephant & Castle  . . . . . . . . . .
## Embankment         . 2 . . . . . . . .
## Harlesden          . . . . . . . . . .
## Harrow & Wealdston . . . . . . . . 1 .
## Kensal Green       . . . . . . . . . .
## Kenton             . . . . . . 1 . . .
## Kilburn Park       . . . . . . . . . .
get.adjacency(graphTube, attr = "distance")[1:10,1:10]
## 10 x 10 sparse Matrix of class "dgCMatrix"
##    [[ suppressing 10 column names 'Baker Street', 'Charing Cross', 'Edgware Road (B)' ... ]]
##                                                                   
## Baker Street       .   .      . .   .      .    .     .    .     .
## Charing Cross      .   .      . . 359.0068 .    .     .    .     .
## Edgware Road (B)   .   .      . .   .      .    .     .    .     .
## Elephant & Castle  .   .      . .   .      .    .     .    .     .
## Embankment         . 359.0068 . .   .      .    .     .    .     .
## Harlesden          .   .      . .   .      .    .     .    .     .
## Harrow & Wealdston .   .      . .   .      .    .     . 1785.108 .
## Kensal Green       .   .      . .   .      .    .     .    .     .
## Kenton             .   .      . .   .      . 1785.108 .    .     .
## Kilburn Park       .   .      . .   .      .    .     .    .     .

From this we see there are two connections between station 87 and 49. Check this in the data.

dfTube[dfTube$station_1 == 49,]
##     toid_seq station_1    station_1_ station_2       station_2_ distance
## 2          3        49 Charing Cross        87       Embankment 179.5034
## 3          4        49 Charing Cross       197 Picadilly Circus 689.2898
## 276      277        49 Charing Cross        87       Embankment 179.5034
## 277      278        49 Charing Cross       151 Leicester Square 436.4846
##      id
## 2     2
## 3     3
## 276 276
## 277 277

Degree distribution

Calculate degree distribution of the tube network

Shortest paths and diameter

A connected graph is one in which every node can be reached by every other node.

Calculate the shortest path between nodes (crude since the weights are distances and not travel times)

startNode = V(graphTube)["Stockwell"]
endNode = V(graphTube)["Mile End"]
shortest_paths(graphTube, startNode, to = endNode, weights = get.edge.attribute(graphTube)$distance, output = "vpath")
## $vpath
## $vpath[[1]]
## + 12/306 vertices, named, from a942d50:
##  [1] Stockwell         Oval              Kennington       
##  [4] Elephant & Castle Borough           London Bridge    
##  [7] Bank              Liverpool Street  Aldgate East     
## [10] Whitechapel       Stepney Green     Mile End         
## 
## 
## $epath
## NULL
## 
## $predecessors
## NULL
## 
## $inbound_edges
## NULL

Calculate the diameter with and without using the distance as the weight.

print(diameter(graphTube, directed = FALSE, unconnected = FALSE, weights = NULL))
## [1] 38
print(diameter(graphTube, directed = FALSE, unconnected = FALSE, weights = get.edge.attribute(graphTube)$distance))
## [1] 71491.12

Plot the graph

graphTube=simplify(graphTube,remove.loops = T,remove.multiple = T,edge.attr.comb = "min")
plot(graphTube,vertex.size=3,vertex.color="red",vertex.label.cex=.1)

1.6 Further Reading

Resources from a workshop on networks and iGraph: https://kateto.net/wp-content/uploads/2016/01/NetSciX_2016_Workshop.pdf

Rmarkdown Cheatsheet

2 Networks 2

Elsa Arcaute

2.1 Learning Objectives

  • Introduce three centrality measures: degree, closeness and betweeness centrality
  • Calculate these centrality measures for a social network
  • Introduce Assortativity and modularity
  • Introduce hierarchical clustering for community detection
  • Apply hierarchical clustering to the karate club dataset

2.2 Centrality

Nodes within a netork can play different roles, and have different levels of importance. Centrality measures help determine what these roles are. In this sense, there is no unique measure of importance, since this is determined with respect to a specific situation/problem at hand. For example, a person might be very important in leading a group of people, while a different one in mediating conflict and so forth. Some examples of centrality measures are:

  • Degree centrality
  • Eigenvector centrality
  • Katz Centrality
  • Closeness centrality
  • Betweeness centrality

Within this section we will introduce degree, closeness and betweenness centrality.

Degree centrality

Degree was introduced in the previous section (1.3). Let us take an example of a social network, and see how it works.

I asked my then 8 years old son Yaan to construct a small immediate network of people important for his dad David. How? build a list of person 1| person 2| weight, where the weight is a number between 1 and 10 relating to the importance of the relationship, 10 being the highest.

#file constructed by Yaan
file_network <- ".\\data\\network_David.csv"

#read and get network
dat=read.csv(file_network,header=TRUE)
head(dat)
##    node1 node2 weight
## 1   Yaan David     10
## 2   Yaan  Elsa     10
## 3   Yaan   Abu     10
## 4   Elsa   Abu     10
## 5  David   Abu      9
## 6 Maxime  Yaan     10
g_david=graph.data.frame(dat,directed=FALSE)

#let us plot it
plot(g_david)

#Plotting the network in this way does not give us enough information
#Now let us compute the degree centrality
deg_david=degree(g_david, v=V(g_david), loops = F, normalized = FALSE)

print(deg_david)
##       Yaan       Elsa      David     Maxime    Valeria Fernandito 
##         24         23         23          5          5         11 
##   Fernando      Belli        Isi       Paul     Ximena       Pame 
##         11         11          7          7          7         11 
##      Rocio        Fer    Esteban     Jaquie        Ana       Anna 
##         11         11         11         12          3          4 
##    Stephan      Marco      Leigh      Trond      Storm   Marcello 
##          7          4          6          6          6          4 
##        Abu 
##         22
#we can plot the graph using the measure of degree
plot(g_david,vertex.size=deg_david/max(deg_david)*14,vertex.color=deg_david,vertex.label.cex=0.5)
title(main = "Social network of David seen by Yaan")

In this plot we can see who is the most connected person, any surprises?

Now recall the previous section in which you learnt that there’s ‘in’ and ‘out’ degree. Who is more important?

What if instead of looking at the number of connections you have, you look at how important those connections are? –> Eigenvector centrality More elaborated algorithms looking at not only the importance of the connections, but at the connections pointing to you, etc, are the ones used to rank internet pages for example, such as Page Rank.

Closeness centrality

Let \(d_{ij}\) be the geodesic distance (shortest path) between node \(i\) and \(j\). The mean geodesic distance of node \(i\) to all other nodes is given by: \[l_{i} = \frac{1}{n}\sum_{j}d_{ij}\] where \(n\) is the total number of nodes.

Given that a node that is very close to most nodes, and has hence low mean geodesic, will be more influential than a node which is far away, gives rise to the following definition for the closeness centrality of node \(i\): \[C_{i} = \frac{1}{l_{i}} = \frac{n}{\sum_{j}d_{ij}}\]

Let us compute the closeness centrality for each node in David’s network. Note that when using a software, it is important that you always read the documentation, since for example, in the case of iGraph in R, if not specified, the measure is not divided by the total n. of nodes. If normalised it will be normalised by \(n-1\).

The link to the documentation is here: https://igraph.org/r/doc/closeness.html

# given that the network is weighted let us introduce the weights.
v_weights=1/(E(g_david)$weight)

# Note that we inverted the weights, since they are meant to represent distance.
# the higher the value to closer they are
clos_david=closeness(g_david,weights = v_weights)

# the following two commands are just to choose colours for the nodes, a palette
normalised_clos_david=(clos_david-min(clos_david))/(max(clos_david)-min(clos_david))
palette=hsv(h=1-((normalised_clos_david*2/3)+1/3),s = 1,v=1)

# Not we can plot it
plot(g_david,vertex.color=palette,vertex.size=normalised_clos_david*15,vertex.label.cex=.5)
title(main = "Closeness centrality of David's network according to Yaan")

And now who is the most important person in the network according to closeness centrality? Is this surprising?

The measure computed above took into account the weights of the network. This means that the distance between two nodes that are not connected, might be shorter than the distance between two nodes that are connected but have very low “friendship” number. If this is disregarded, the measure only corresponds to the shortest path computed according to the number of links between them.

#topological closeness
clos_top_david=closeness(g_david, weights = NA)
normalised_clos_top_david=(clos_top_david-min(clos_top_david))/(max(clos_top_david)-min(clos_top_david))
palette_top=hsv(h=1-((normalised_clos_top_david*2/3)+1/3),s = 1,v=1)
plot(g_david,vertex.color=palette_top,vertex.size=normalised_clos_top_david*15,vertex.label.cex=.5)
title(main = "Topological closeness centrality")

Betweenness centrality

The betweenness centrality of a vertex corresponds to the number of shortest paths passing through it among all pairs. Edge betweennness is defined in a smilar, where the edge is within the shortest path.

Define \(n^{i}_{st}\) as: \[n^{i}_{st} = \begin{cases} 1, & \text{if vertex }i\text{ lies on the geodesic path from }s\text{ to }t\\ 0, & \text{otherwise} \end{cases}\]

Then betweenness centrality can be defined as: \[x_{i} = \sum_{st}n^{i}_{st}\]

However, there may be multiple geodesics from \(s\) to \(t\) so to account for this we normalise by the number of geodesics from \(s\) to \(t\), \(g_{st}\): \[x_{i} = \sum_{st}\frac{n^{i}_{st}}{g_{st}}\]

#Let us compute the betweenness centrality for the network 
bet_david=betweenness(g_david, v=V(g_david), directed = F, normalized = FALSE, weights = (E(g_david)$weight))
palette=hsv(h=1-((bet_david/max(bet_david)*2/3)+1/3),s = 1,v=1)
plot(g_david,vertex.color=palette,vertex.size=bet_david/max(bet_david)*18,vertex.label.cex=.5)
title(main = "Broker of David's network seen by Yaan")

This is a rather nice result: the grandma is the broker of the system!

Note that such a nice result fooled us. Why? What needs to be ammended?

Let us look at the topological betweenness and see whether we get any insights:

#Topological betweenness centrality for the network 
bet_david_top=betweenness(g_david, v=V(g_david), directed = F, normalized = FALSE, weights = NA)
palette_top=hsv(h=1-((bet_david_top/max(bet_david_top)*2/3)+1/3),s = 1,v=1)
plot(g_david,vertex.color=palette_top,vertex.size=bet_david_top/max(bet_david_top)*18,vertex.label.cex=.5)
title(main = "Broker (topological) of David's network seen by Yaan")

There are 2 lessons to learn from this:

  • assigning arbitrary numbers to relationships can lead to misleading results

  • we need to be careful when using weights: what are they encoding? how are they representing the system? how can we integrate them correctly?

2.3 Assortativity and Modularity

In order to talk about clustering in networks, one needs to introduce a set of concepts, such as clustering coefficient, modularity, assortativity, etc., which need more than the few remaining minutes that we have. In this sense, we will give you here “a taster” of this field, since community detection is a field in its own right within networks.

Assortativity, dissassortativity

include_graphics(".\\img\\networks2\\assort.png")

include_graphics(".\\img\\networks2\\assort2.png")

Let us look at another example where we do not have more information on the ndoes than what you see below.

include_graphics(".\\img\\networks2\\assort_2nets_plus.png")

include_graphics(".\\img\\networks2\\assort_null.png")

include_graphics(".\\img\\networks2\\modularity.png")

include_graphics(".\\img\\networks2\\comm_detection.png")

2.4 Hierarchical clustering

There are two types of hierarchical clustering algorithm: Agglomerative and Divisive

–> both methods are based on assigning a measure for each link, say a similarity measure between nodes, or a centrlaity measure for the link, and then depending on the algorithm we have: either nodes connected by high value links (e.g. high similarity) are merged, or links with low value are removed (e.g. for low similarity).

The process for hierarchical clustering is as follows:

  1. Define a similarity measure, or a centrality measure

  2. Commpute the similarity or centrality measure for the network

  3. Iteratively merge nodes or remove links until all nodes are merged, or all links are removed

  4. Build a dendogram odf the process

  5. Identify the cut in the tree that maximises the modularity

The following slides taken from Barabasi’s lecture for the Girvan-Newman algorithm illustrate the method very clearly

include_graphics(".\\img\\networks2\\G_N1.png")

include_graphics(".\\img\\networks2\\G_N2.png")

Karate Club network

In this section we apply some of the community detection algorithms available in iGraph to Zachary’s Karate club network. Look it up, this is the most famous netwrk to test community detection algorithms.

#It's so famous it's already in iGraph
g_karate <- make_graph("Zachary")
#let us now produce a plot such that the nodes in the plot have a size related to their degree
V(g_karate)$size=degree(g_karate)
plot(g_karate,vertex.label=NA)
title('Karate club friendship network', cex.main = 1.25,line=2)
title('Zachary 1977', cex.main = 1, line=1)

Girvan Newman method M. Girvan and M. E. Newman, Proc. Natl. Acad. Sci. U.S.A., 99, 782 (2002)

M. E. Newman and M. Girvan, Physical Review E 69, 026113 (2004)

iGraph: edge.betweenness.community

http://igraph.org/r/doc/community.edge.betweenness.html

http://www.inside-r.org/packages/cran/igraph/docs/edge.betweenness.community

How does it work?

  • Hierarchical divisive algorithm: remove edges (links) according to their edge betweenness values (in decreasing order).

  • Removal stops when modularity of the partition is maximised.

–> Edges with high betweenness will most likely be the ones connecting and hence belonging to different groups.

Performance

  • Good

  • slow: O(N3): betweenness needs to be re-calculated each time that an edge is removed

Weighted graphs? Yes

Directed graphs? Yes

#------------
#Girvan-Newman algorithm
#------------
    alg_name='Girvan-Newman'
    GN_comm <- edge.betweenness.community(g_karate)
    #look at result of algorithm
    print(GN_comm)
## IGRAPH clustering edge betweenness, groups: 5, mod: 0.4
## + groups:
##   $`1`
##    [1]  1  2  4  8 12 13 14 18 20 22
##   
##   $`2`
##   [1]  3 25 26 28 29 32
##   
##   $`3`
##   [1]  5  6  7 11 17
##   
##   $`4`
##   + ... omitted several groups/vertices
    #the process aims at maximising the modularity
    modularity(GN_comm)
## [1] 0.4012985
    # number of communities and their sizes
    nc=length(GN_comm)
    sizes(GN_comm)
## Community sizes
##  1  2  3  4  5 
## 10  6  5 12  1
    #if you need the membership vector only
    mem_vec=membership(GN_comm)
    #visualise the results
    plot(GN_comm,g_karate,layout=layout.fruchterman.reingold)
    title(paste0(alg_name,' algorithm \n',nc,' communities for the karate club'))

    #The division into communities is determined by the choice of location for cutting  the hierarchical tree.
    #The threshold is selected such that the modularity is maximised
    #Let us look at this through the dendrogram
    dendPlot(GN_comm,use.modularity = TRUE)
    title(paste0('Community structure dendrogram for ',alg_name,' method \n',nc,' communities for the karate club'))

# !!!!!! This method is based on computing betweenness in an iterative way, and hence it is extremely slow for large networks. 

Fast Greedy modularity optimization: Clauset, Newman and Moore

Clauset, M. E. J. Newman, and C. Moore, Phys. Rev. E 70, 066111 (2004)

http://www.arxiv.org/abs/cond-mat/0408187

iGraph: fastgreedy.community

http://igraph.org/r/doc/fastgreedy.community.html

http://www.inside-r.org/packages/cran/igraph/docs/fastgreedy.community

How does it work?

  • Assign each node to its own community, no links between communities

  • Inspect each community pair connected by a link, and assign them to the same community if the link increases maximally the Girvan-Newman modularity.

  • Bottom-up instead of top-down: communities are merged iteratively such that each merge is locally optimal

Performance

  • Not as good as Girvan-Newman

  • fast: O(NLog2N)

  • Resolution limit

Weighted graphs? Yes

Directed graphs? No

#------------
#Fast greedy modularity optimisation: Clauset-Newman-Moore
#------------
    alg_name='Clauset-Newman-Moore'
    greedy_c <- fastgreedy.community(g_karate)
    #look at result of algorithm
    print(greedy_c)
## IGRAPH clustering fast greedy, groups: 3, mod: 0.38
## + groups:
##   $`1`
##   [1]  1  5  6  7 11 12 17 20
##   
##   $`2`
##    [1]  9 15 16 19 21 23 24 25 26 27 28 29 30 31 32 33 34
##   
##   $`3`
##   [1]  2  3  4  8 10 13 14 18 22
## 
    #the process aims at maximising the modularity
    modularity(greedy_c)
## [1] 0.3806706
    # number of communities and their sizes
    nc=length(greedy_c)
    sizes(greedy_c)
## Community sizes
##  1  2  3 
##  8 17  9
    #if you need the membership vector only
    mem_vec=membership(greedy_c)
    #visualise the results
    plot(greedy_c,g_karate)#,layout=layout.fruchterman.reingold)
    title(paste0(alg_name,' algorithm \n',nc,' communities for the karate club'))

    #The division into communities is determined by the choice of location for cutting  the hierarchical tree.
    #The threshold is selected such that the modularity is maximised
    #Let us look at this through the dendrogram
    dendPlot(greedy_c,use.modularity = TRUE)
    title(paste0('Community structure dendrogram for ',alg_name,' method \n',nc,' communities for the karate club'))

Random walk: Pons & Latapy

Pascal Pons, Matthieu Latapy: Computing communities in large networks using random walks, http://arxiv.org/abs/physics/0512106

iGraph: walktrap.community

How does it work?

  • This function tries to find densely connected subgraphs

  • short random walks tend to stay in the same community

  • merge separate communities in a bottom-up manner
  • can use the modularity score to select where to cut the dendrogram

Performance

  • Slower than greedy
  • More accurate than greedy

Weighted graphs? Yes

Directed graphs? No

#------------
#Community strucure via short random walks: Pons&Latapy
#------------
    alg_name='Random walks'
    wtrap_c <- walktrap.community(g_karate)
    comm=wtrap_c
    #look at result of algorithm
    print(comm)
## IGRAPH clustering walktrap, groups: 5, mod: 0.35
## + groups:
##   $`1`
##   [1]  1  2  4  8 12 13 18 20 22
##   
##   $`2`
##   [1]  3  9 10 14 29 31 32
##   
##   $`3`
##   [1] 15 16 19 21 23 27 30 33 34
##   
##   $`4`
##   + ... omitted several groups/vertices
    #the process aims at maximising the modularity
    modularity(comm)
## [1] 0.3532216
    # number of communities and their sizes
    nc=length(comm)
    sizes(comm)
## Community sizes
## 1 2 3 4 5 
## 9 7 9 4 5
    #if you need the membership vector only
    mem_vec=membership(comm)
    #visualise the results
    plot(comm,g_karate)#,layout=layout.fruchterman.reingold)
    title(paste0(alg_name,' algorithm \n',nc,' communities for the karate club'))

    #The division into communities is determined by the choice of location for cutting  the hierarchical tree.
    #The threshold is selected such that the modularity is maximised
    #Let us look at this through the dendrogram
    dendPlot(comm,use.modularity = TRUE)
    title(paste0('Community structure dendrogram for ',alg_name,' method \n',nc,' communities for the karate club'))

Leading eigenvector: Newman spectral approach

MEJ Newman: Finding community structure using the eigenvectors of matrices, Physical Review E 74 036104, (2006)

iGraph: leading.eigenvector.community

http://igraph.org/r/doc/leading.eigenvector.html

http://www.inside-r.org/packages/cran/igraph/docs/leading.eigenvector.community

How does it work?

  • Method based on the spectral properties of graphs: IF communities well identified, e-vector components for nodes in same communities will have similar values

  • compute eigenvector of modularity matrix for the largest positive eigenvalue.

  • separate vertices into two community based on the sign of the corresponding element in the eigenvector. If all elements in the eigenvector are of the same sign that means that the network has no underlying community structure.

  • Choose partition that maximises the G-N modularity

Performance

  • Fast if select only a few e-vectors to compute, although slower than greedy

  • Good results

Weighted graphs? No

Directed graphs? No

#------------
#Leading eigenvector: Newman spectral approach
#------------
    alg_name='Spectral'
    spectral_c <- leading.eigenvector.community(g_karate)
    comm=spectral_c
    #look at result of algorithm
    print(comm)
## IGRAPH clustering leading eigenvector, groups: 4, mod: 0.39
## + groups:
##   $`1`
##   [1]  1  5  6  7 11 12 17
##   
##   $`2`
##    [1]  9 10 15 16 19 21 23 27 30 31 33 34
##   
##   $`3`
##   [1]  2  3  4  8 13 14 18 20 22
##   
##   $`4`
##   + ... omitted several groups/vertices
    #the process aims at maximising the modularity
    modularity(comm)
## [1] 0.3934089
    # number of communities and their sizes
    nc=length(comm)
    sizes(comm)
## Community sizes
##  1  2  3  4 
##  7 12  9  6
    #if you need the membership vector only
    mem_vec=membership(comm)
    #visualise the results
    plot(comm,g_karate)#,layout=layout.fruchterman.reingold)
    title(paste0(alg_name,' algorithm \n',nc,' communities for the karate club'))

    #Doesn't get a dendrogram

Louvain: Blondel et al, modularity optimization

V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre, J. Stat. Mech.: Theory Exp. 2008, P10008.

http://arxiv.org/abs/arXiv:0803.0476

iGraph: multilevel.community

http://igraph.org/r/doc/multilevel.community.html

http://www.inside-r.org/packages/cran/igraph/docs/multilevel.community

How does it work?

  • Local optimization of the G-N modularity in the neighbourhood of each node

  • Start from the whole network, identify a community, replace the community by a super-node –> get smaller weighted and repeat

  • Iterate until modularity stable

Performance

  • Excellent

  • Computational linear complexity w.r.t. n. nodes

Weighted graphs? Yes

Directed graphs? No

#------------
#Louvain method: Blondel et al, modularity optimization
#------------
    alg_name='Louvain'
    louv_c <- multilevel.community(g_karate)
    comm=louv_c
    #look at result of algorithm
    print(comm)
## IGRAPH clustering multi level, groups: 4, mod: 0.42
## + groups:
##   $`1`
##   [1]  5  6  7 11 17
##   
##   $`2`
##    [1]  1  2  3  4  8 10 12 13 14 18 20 22
##   
##   $`3`
##   [1] 24 25 26 28 29 32
##   
##   $`4`
##   + ... omitted several groups/vertices
    #the process aims at maximising the modularity
    modularity(comm)
## [1] 0.4188034
    # number of communities and their sizes
    nc=length(comm)
    sizes(comm)
## Community sizes
##  1  2  3  4 
##  5 12  6 11
    #if you need the membership vector only
    mem_vec=membership(comm)
    #visualise the results
    plot(comm,g_karate)#,layout=layout.fruchterman.reingold)
    title(paste0(alg_name,' algorithm \n',nc,' communities for the karate club'))

Infomap: Rosvall & Bergstrom

M. Rosvall and C. T. Bergstrom, Maps of information flow reveal community structure in complex networks, PNAS 105, 1118 (2008)

M. Rosvall, D. Axelsson, and C. T. Bergstrom, The map equation, Eur. Phys. J. Special Topics 178, 13 (2009).

iGraph: infomap.community

http://igraph.org/r/doc/infomap.html

http://www.inside-r.org/packages/cran/igraph/docs/infomap.community

How does it work?

Find community structure that minimizes the expected description length of a random walker trajectory

Performance

  • Fast
  • Excellent, best performance

Weighted graphs? Yes

Directed graphs? Yes

#------------
#Infomap method: Rosvall and Bergstrom
#------------
    alg_name='Infomap'
    info_c <- infomap.community(g_karate)
    comm=info_c
    #look at result of algorithm
    print(comm)
## IGRAPH clustering infomap, groups: 3, mod: 0.4
## + groups:
##   $`1`
##    [1]  9 15 16 19 21 23 24 25 26 27 28 29 30 31 32 33 34
##   
##   $`2`
##    [1]  1  2  3  4  8 10 12 13 14 18 20 22
##   
##   $`3`
##   [1]  5  6  7 11 17
## 
    #the process aims at maximising the modularity
    modularity(comm)
## [1] 0.4020381
    # number of communities and their sizes
    nc=length(comm)
    sizes(comm)
## Community sizes
##  1  2  3 
## 17 12  5
    #if you need the membership vector only
    mem_vec=membership(comm)
    #visualise the results
    plot(comm,g_karate)#,layout=layout.fruchterman.reingold)
    title(paste0(alg_name,' algorithm \n',nc,' communities for the karate club'))

include_graphics(".\\img\\networks2\\table_mod.png")

include_graphics(".\\img\\networks2\\Expert.png")

include_graphics(".\\img\\networks2\\Expert2.png")

3 Networks 3 Neave


3.1 Learning Objectives

This session will cover the following:

  • How trade data can be used to model the capabilities of countries
  • Calculating Relative Compatative Advantage (RCA) for a country and product
  • Introduce The Product Space and using RCA to compute the proximity between products
  • Identifying products a country has the capacity to succeed at exporting

The code presented here follows the methods used in The Product Space Conditions the Development of Nations By C. A. Hidalgo, B. Klinger, A.-L. Barabási, R. Hausmann, Science Jul 2007

It’s a great paper and worth a read.







3.2 Trade Data

You can access world import and export data from the United Nations here: https://cid.econ.ucdavis.edu/nberus.html.

See page 48 of the documentation (.\data\wtf99\NBER-UN_Data_Documentation_w11040.pdf) for an explanation of the variables.

# Load world import and export data from 1999. We restrict our analysis to a single year for ease but it is better to use all available years.
dfWorldTrade <- read.dta13(".\\data\\wtf99\\wtf99.dta")
head(dfWorldTrade)
##   year  icode importer  ecode exporter sitc4 unit dot   value quantity
## 1 1999 100000    World 100000    World             NA 3854744       NA
## 2 1999 100000    World 100000    World  0011       NA 1041246       NA
## 3 1999 100000    World 100000    World  0011    N  NA  388392  1970655
## 4 1999 100000    World 100000    World  0011    W  NA 2537670  1357983
## 5 1999 100000    World 100000    World  0012       NA   36635       NA
## 6 1999 100000    World 100000    World  0012    N  NA   36197   669442

Products are differentiated by their sitc4 code

# 1263 different product codes includes 1 blank code so 1262 in practice
length(unique(dfWorldTrade$sitc4))
## [1] 1263
# 188 unique exporter countries
length(unique(dfWorldTrade$ecode))
## [1] 188

Data cleaning

# Remove invalid product codes and select only the columns we are interested in
dfWorldTrade <- dfWorldTrade[dfWorldTrade$sitc4 !="",c("ecode","sitc4","value")]

# Remove entries where the exporter is 'World'. The World category is assigned to imports/exports with missing data.
dfWorldTrade <- dfWorldTrade[dfWorldTrade$ecode != "100000",]

head(dfWorldTrade)
##       ecode sitc4  value
## 2301 117100  0011   2262
## 2302 117100  0012    131
## 2303 117100  0014    556
## 2304 117100  0015   2422
## 2305 117100  0111    278
## 2306 117100  0111 117149


3.3 Revealed Comparative Advantage

Group by country and product and calculate the value of each county’s exports by product

# Aggregate the data
dfWTCountryAgg <- aggregate(dfWorldTrade$value, by = list(dfWorldTrade$ecode), sum, na.rm = TRUE)
dfWTCountryProdAgg <- aggregate(dfWorldTrade$value, by = list(dfWorldTrade$ecode, dfWorldTrade$sitc4), sum, na.rm = TRUE)
dfWTProductAgg <- aggregate(dfWorldTrade$value, by = list(dfWorldTrade$sitc4), sum, na.rm = TRUE)
globalSum <- sum(dfWorldTrade$value)

# Rename the columns
colnames(dfWTCountryAgg) <- c("ecode", "etotal")
colnames(dfWTProductAgg) <- c("sitc4", "ptotal")
colnames(dfWTCountryProdAgg) <- c("ecode", "sitc4", "eptotal")
head(dfWTCountryAgg)
##    ecode   etotal
## 1 117100 63359546
## 2 130120 26996016
## 3 134340 17184250
## 4 135040 16460714
## 5 137360  1173686
## 6 137880 12947674
head(dfWTProductAgg)
##   sitc4  ptotal
## 1  0011 7934616
## 2  0012 1597522
## 3  0013 2496742
## 4  0014 1814336
## 5  0015 3247458
## 6  0019    1920
head(dfWTCountryProdAgg)
##    ecode sitc4 eptotal
## 1 117100  0011    4524
## 2 167060  0011   16028
## 3 211240  0011 1456330
## 4 218400  0011  422514
## 5 330320  0011    7852
## 6 330760  0011    1982

Join these datasets together and use this to calculate RCA

dfRCA <- merge(dfWTCountryProdAgg, dfWTCountryAgg, by ="ecode")
dfRCA <- merge(dfRCA, dfWTProductAgg, by="sitc4")
dfRCA$prod_in_country_share <-  dfRCA$eptotal / dfRCA$etotal
dfRCA$prod_in_global_share <- dfRCA$ptotal / globalSum
dfRCA$rca = dfRCA$prod_in_country_share / dfRCA$prod_in_global_share
head(dfRCA)
##   sitc4  ecode eptotal    etotal  ptotal prod_in_country_share
## 1  0011 117100    4524  63359546 7934616          7.140203e-05
## 2  0011 457020     646 178684928 7934616          3.615302e-06
## 3  0011 532500 2423430 601149100 7934616          4.031329e-03
## 4  0011 338580   25338   4825320 7934616          5.251051e-03
## 5  0011 451160    5612   2336744 7934616          2.401632e-03
## 6  0011 530560  300186 287522372 7934616          1.044044e-03
##   prod_in_global_share         rca
## 1         0.0006855073 0.104159392
## 2         0.0006855073 0.005273907
## 3         0.0006855073 5.880796810
## 4         0.0006855073 7.660094153
## 5         0.0006855073 3.503437883
## 6         0.0006855073 1.523023789

RCA > 1 means that this product makes up a greater share of this country’s exports than the “average” country.






3.4 The Product Space

To calculate the proximity between products we use RCA as an indication of whether a country exports a product or not. If a country has an RCA > 1 for a product it is considered to export that product.

Using this the coditional probability of a product being exported given another product being exported is given by:

\[P(product_{i} | product_{j}) = \frac{\text{number of countries exporting i and j}}{\text{number of countries exporting j}}\]

# Number of counties exporting each product
dfRCA$is_exported <- as.numeric(dfRCA$rca > 1)
dfNExport <- aggregate(dfRCA$is_exported, by = list(dfRCA$sitc4), sum)
colnames(dfNExport)[2] <- "nexporters"
head(dfNExport)
##   Group.1 nexporters
## 1    0011         21
## 2    0012         24
## 3    0013         13
## 4    0014         15
## 5    0015         20
## 6    0019          4
# Number of countries exporting product paris
# This block takes a long time to run. To save time (and avoid crashing your computer), you can read the data from csv using the next block instead.

# Merge on country code to match exports to one another if they are both exported by the same country, then count the number of countries than export both products
#dfPairs <- merge(dfRCA, dfRCA, by = "ecode")
#dfPairs$joint_export <-  as.numeric((dfPairs$is_exported.x == 1) & (dfPairs$is_exported.y ==1))
#dfNJointExport <- aggregate(dfPairs$joint_export, by = list(dfPairs$sitc4.x,dfPairs$sitc4.y), sum)
#colnames(dfNJointExport)[3] <- "njointexporters"
#write.csv(dfNJointExport, ".\\data\\product_joint_export_counts.csv", row.names=FALSE)
#head(dfNJointExport)
dfNJointExport <- read.csv(".\\data\\product_joint_export_counts.csv")
head(dfNJointExport)
##   Group.1 Group.2 njointexporters
## 1    0011    0011              21
## 2    0012    0011               9
## 3    0013    0011               5
## 4    0014    0011               4
## 5    0015    0011               6
## 6    0019    0011               1

Use these two tables to compute the conditional probabilities

dfConditional <- merge(dfNJointExport, dfNExport, by = 'Group.1')
dfConditional$pij <- dfConditional$njointexporters / dfConditional$nexporters
head(dfConditional)
##   Group.1 Group.2 njointexporters nexporters        pij
## 1    0011    0011              21         21 1.00000000
## 2    0011    6424               7         21 0.33333333
## 3    0011    01AA               1         21 0.04761905
## 4    0011    2772               3         21 0.14285714
## 5    0011    4313               5         21 0.23809524
## 6    0011    842X               1         21 0.04761905

Now identify and select the minimum joint probability by joining the data frame of confitional probabilities to itself.

# Merge the dfConditional dataframe to itself so that we can compare the joint probabilities P(export i | export j) to P(export j | export i)
dfProximity <- merge(dfConditional, dfConditional, by.x = c("Group.1","Group.2"), by.y = c("Group.2", "Group.1"))
dfProximity$pij.x.islower <- dfProximity$pij.x < dfProximity$pij.y
head(dfProximity)
##   Group.1 Group.2 njointexporters.x nexporters.x      pij.x
## 1    0011    0011                21           21 1.00000000
## 2    0011    0012                 9           21 0.42857143
## 3    0011    0013                 5           21 0.23809524
## 4    0011    0014                 4           21 0.19047619
## 5    0011    0015                 6           21 0.28571429
## 6    0011    0019                 1           21 0.04761905
##   njointexporters.y nexporters.y     pij.y pij.x.islower
## 1                21           21 1.0000000         FALSE
## 2                 9           24 0.3750000         FALSE
## 3                 5           13 0.3846154          TRUE
## 4                 4           15 0.2666667          TRUE
## 5                 6           20 0.3000000          TRUE
## 6                 1            4 0.2500000          TRUE
# We can check this merge has given us the right thing by checking the conditional probability value for Group.1 = "0015" and Group.2 = "0011" has been merged with the value for Group.1 = "0011" and Group.2 = "0015"
dfConditional[(dfConditional$Group.1=="0015") & (dfConditional$Group.2 == "0011"),]
##      Group.1 Group.2 njointexporters nexporters pij
## 5393    0015    0011               6         20 0.3
# Now select the lower conditional probabilities. The lower conditional probability is used as the proximity between two products
dfProximity1 <- dfProximity[dfProximity$pij.x.islower ==TRUE, c("Group.1", "Group.2", "pij.x")]
dfProximity2 <- dfProximity[dfProximity$pij.x.islower ==FALSE, c("Group.1", "Group.2", "pij.y")]

colnames(dfProximity1) <- c("product1","product2", "proximity")
colnames(dfProximity2) <- c("product1","product2", "proximity")

dfProximity <- rbind(dfProximity1, dfProximity2)

# Drop entries where product1 == product2 as the proximity is trivally 1
dfProximity <- dfProximity[dfProximity$product1!=dfProximity$product2,]

# Check for duplicates
any(duplicated(dfProximity[,1:2]))
## [1] FALSE

This dataframe can be used to build a network where the nodes are the products and the edge weights between products are the proximity.

head(dfProximity)
##    product1 product2  proximity
## 3      0011     0013 0.23809524
## 4      0011     0014 0.19047619
## 5      0011     0015 0.28571429
## 6      0011     0019 0.04761905
## 10     0011     0112 0.19047619
## 11     0011     0113 0.38095238

Create an iGraph object from this edge list

graphProdSpace <- graph_from_data_frame(dfProximity, directed = FALSE)

This is a very dense graph so plotting the whole thing will not be helpful. In the Hidalgo (2007) paper the authors describe the variety of methods they used to represent the network visually and produce a figure like that in the slide below. See the supporting online material for details: https://science.sciencemag.org/content/sci/suppl/2007/07/26/317.5837.482.DC1/Hidalgo.SOM.pdf





3.5 Diversification

Calculate ‘density’ around products for a single country (eg UK)

# First subsect the product space to consider only the products exported by a particular country, eg the UK

# The data documentation (.\\data\\wtf99\\NBER-UN_Data_Documentation_w11040.pdf) contains a lookup from country name to country code. The UK is 538260

# Get all products extored by the UK
ukExports <- dfRCA[(dfRCA$is_exported == 1) & (dfRCA$ecode == "538260"),"sitc4"]

# Using dfProximity as the edge list, remove connections to nodes not exported by the UK
dfProximityUK <- dfProximity[dfProximity$product2 %in% ukExports,]

# Now aggregate by product to get the weight contribution from UK exported products only
dfStrengthUK <- aggregate(dfProximityUK$proximity, by = list(dfProximityUK$product1), sum)
colnames(dfStrengthUK) <- c("product","strengthUK")
head(dfStrengthUK)
##   product strengthUK
## 1    0011   52.28548
## 2    0012   54.28734
## 3    0013   47.37124
## 4    0014   82.83941
## 5    0015   63.04949
## 6    0019   13.22206
# The denominator of the density of a node is given by the sum of its edge weights. This is similar to the node degree but instead we use the edge function
strengthProdSpace <- strength(graphProdSpace,weights = get.edge.attribute(graphProdSpace)$proximity)

dfStrength <- data.frame(product = names(strengthProdSpace), strength = strengthProdSpace)
rownames(dfStrength) <- c(1:length(dfStrength$product))
head(dfStrength)
##   product  strength
## 1    0011 334.12564
## 2    0012 303.94074
## 3    0013 314.03959
## 4    0014 420.02677
## 5    0015 321.33070
## 6    0019  70.90681
# Now combine together and identify the high density products the UK does not export
dfStrength <- merge(dfStrength, dfStrengthUK)
dfStrength$density <- dfStrength$strengthUK / dfStrength$strength

# Now select just the products that the UK doesn't export and rank by strength
dfUKNotExp <- dfStrength[!(dfStrength$product %in% ukExports),]
head(dfUKNotExp[with(dfUKNotExp, order(density, decreasing = TRUE)),])
##      product  strength strengthUK   density
## 450     524A  58.77814   11.14369 0.1895890
## 1089    791X  58.77814   11.14369 0.1895890
## 430     5169 381.65871   71.24523 0.1866726
## 6       0019  70.90681   13.22206 0.1864710
## 473     5416 411.47924   75.92215 0.1845103
## 1027    7741 317.77398   58.46509 0.1839833

What products are these?

# The descriptions for revision 1 and 2 of the SITC codes can be downloaded from here: https://unstats.un.org/unsd/tradekb/Knowledgebase/50262/Commodity-Indexes-for-the-Standard-International-Trade-Classification-Revision-3

# The codes used here are the 4th revision but we will use the descriptions from the 2nd revision for ease.

# The 4th revision is detailed here: https://unstats.un.org/unsd/publications/catalogue?selectID=104
dfSITC <- read.csv(".\\data\\SITC Rev2.csv")
head(dfSITC)
##   ï..Commodity.Code
## 1                 0
## 2                00
## 3               001
## 4              0011
## 5             00111
## 6             00119
##                                       Commodity.description
## 1                    Food and live animals chiefly for food
## 2                             Live animals chiefly for food
## 3                             Live animals chiefly for food
## 4 Animals of the bovine species (including buffaloes), live
## 5                                 -- pure bred for breeding
## 6                      -- other than pure bred for breeding
# Add product descriptions into the stength dataframe
dfStrength$product_group <- substr(dfStrength$product, start = 1, stop = 2)
dfStrength <- merge(dfStrength, dfSITC, by.x = c("product"), by.y = c("ï..Commodity.Code"))
dfStrength <- merge(dfStrength, dfSITC, by.x = c("product_group"), by.y = c("ï..Commodity.Code"))
colnames(dfStrength)[6:7] <- c("product_description","group_description")
head(dfStrength)
##   product_group product  strength strengthUK   density
## 1            00    0011 334.12564   52.28548 0.1564845
## 2            00    0012 303.94074   54.28734 0.1786116
## 3            00    0013 314.03959   47.37124 0.1508448
## 4            00    0014 420.02677   82.83941 0.1972241
## 5            00    0015 321.33070   63.04949 0.1962137
## 6            00    0019  70.90681   13.22206 0.1864710
##                                         product_description
## 1 Animals of the bovine species (including buffaloes), live
## 2                                     Sheep and goats, live
## 3                                               Swine, live
## 4                                             Poultry, live
## 5                                      Equine species, live
## 6    Live animals of a kind mainly used for human food, nes
##               group_description
## 1 Live animals chiefly for food
## 2 Live animals chiefly for food
## 3 Live animals chiefly for food
## 4 Live animals chiefly for food
## 5 Live animals chiefly for food
## 6 Live animals chiefly for food
# Now select just the products that the UK doesn't export and rank by strength
dfUKNotExp <- dfStrength[!(dfStrength$product %in% ukExports),]
head(dfUKNotExp[with(dfUKNotExp, order(density, decreasing = TRUE)),])
##     product_group product  strength strengthUK   density
## 268            51    5169 381.65871   71.24523 0.1866726
## 6              00    0019  70.90681   13.22206 0.1864710
## 292            54    5416 411.47924   75.92215 0.1845103
## 633            77    7741 317.77398   58.46509 0.1839833
## 715            87    8720 297.62421   54.32069 0.1825144
## 634            77    7742 347.62119   63.27810 0.1820318
##                                             product_description
## 268                                      Organic chemicals, nes
## 6        Live animals of a kind mainly used for human food, nes
## 292 Glycosides, glands, antisera, vaccines and similar products
## 633                                   Electro-medical equipment
## 715                     Medical instruments and appliances, nes
## 634  X-ray apparatus and equipment; accessories; and parts, nes
##                                                     group_description
## 268                                                 Organic chemicals
## 6                                       Live animals chiefly for food
## 292                             Medicinal and pharmaceutical products
## 633 Electric machinery, apparatus and appliances, nes, and parts, nes
## 715 Professional, scientific, controlling instruments, apparatus, nes
## 634 Electric machinery, apparatus and appliances, nes, and parts, nes
# What product *does* the UK import that have the highest density
dfUKExp <- dfStrength[dfStrength$product %in% ukExports,]
head(dfUKExp[with(dfUKExp, order(density, decreasing = TRUE)),])
##     product_group product  strength strengthUK   density
## 188            28    2860  98.87077   33.12124 0.3349953
## 606            75    7521 182.50646   49.63667 0.2719721
## 535            71    7149 224.58335   57.67624 0.2568144
## 674            79    7928 208.63288   53.46280 0.2562530
## 675            79    7929 215.60783   54.39888 0.2523047
## 121            22    2234 194.68738   48.29262 0.2480521
##                                                  product_description
## 188                     Ores and concentrates of uranium and thorium
## 606                     Analogue and hybrid data processing machines
## 535 Parts, nes of the engines and motors of group 714 and item 71888
## 674                           Aircraft, nes and associated equipment
## 675                        Parts, nes of the aircraft of heading 792
## 121                                                          Linseed
##                                           group_description
## 188                      Metalliferous ores and metal scrap
## 606 Office machines and automatic data processing equipment
## 535                Power generating machinery and equipment
## 674                               Other transport equipment
## 675                               Other transport equipment
## 121                          Oil seeds and oleaginous fruit